5.3 Likelihood-Based Inference

1 Likelihood-Based Inference

Setting: X1,,Xni.i.dpθ(x), pθ(x) is "smooth" in θ. Assume Eθl1(θ;Xi)=0, Varθ[l1(θ;Xi)]=Eθ2l1(θ;Xi)=J1(θ) positive definitive. (see Fisher information), and MLE is consistent: θ^MLEpθθ. Then if θ=θ0, 1nln(θ0;X)Nd(0,J1(θ0)),1n2Jn(θ0;X)pJ1(θ0).
Since we used 0=ln(θ^n)ln(θ0)+2ln(θ0)(θ^nθ0) to get n(θ^nθ0)Nd(0,J1(θ0)1), then we can use this for inference on θ0!

1.1 Wald-Type Confidence Regions

Assume we have some estimator J^n0 s.t. 1nJ^npJ1(θ0)0, then we plug in: if n(θ^nθ0)Nd(0,J1(θ0)1), then (J1(θ0))12n(θ^nθ0)Nd(0,Id), so by Slutsky's theorem J^n12(θ^nθ0)Nd(0,Id). This leads to test of H0:θ=θ0: ||J^n12(θ^nθ0)||2χd2, so Pθ0(J^n12(θ^nθ0)χd2(α))α.
So we reject θ0 iff ||J^n12(θn^θ0)||2>χd2(α). reject θ0 iff θ0θ^n+J^n12Bχd2(α)(0)confidence ellipsoid. J^n12=Op(1/n)n12J1(θ0)12.
For d=1, we reject θ0 iff |J^n(θ^nθ0)|>zα2. reject θ0 iff θ0θ^n±J^n12zα2.

More info smaller ellipse (shrinks like 1/n)

Options for J^n:

  1. Most obvious to plug in MLE for Jn(θ): J^n=Jn(θ^n)=Varθ(ln(θ;X))|θ=θ^nVarθ^n(ln(θ^n;X)).or J^n=Eθ2ln(θ)|θ=θ^n.
  2. Observed Fisher information: J^n=2ln(θ^n;X).

Both have 1np1nJn(θ0) (under regularity; cts second derivative; MLE consistent).
Both make sense outside of iid setting: J^nJnp1.
Heuristically, plug-in measures info about θ in "typical" data set but observe info measures info about θ in "this" data set.

Wald interval for θj: θ^nNd(θ0,Jn(θ0)1),θ^n,jNd(θ0,j,(Jn(θ0)1)jj)se(θ^n,j)2,cj=θ^n,j±(J^n1)jjzα2.
Confidence ellipsoid: θ0,S=(θ0,j)jS, |S|=kd. θ^n,sNk(θ0,S,(Jn(θ0)1)SS), cS=θ^n,S+((J^n1)SS)12Bχk(α)(0).
More generally, if θ^n is any consistent estimator with n(θ^nθ0)Nd(0,Σ(θ0)), and we have 1nΣ^npθ0Σ(θ0)>0,Σ^n12(θ^nθ0)Nd(0,Id). (θ^n not necessarily MLE)

Taylor expansion of ln: J^n12(β^nβ0)Nd(0,Id),J^n=2l(β^n;X).


Advantages and Disadvantages
Advantages:

  1. Easy to invert, simple confidence regions.
  2. Asymptotically correct.
    Disadvantages:
  3. Have to compute MLE.
  4. Depends on parameterization.
  5. Relies on two approximations: lnNormal,lnquadratic.
  6. Needs MLE to be consistent
  7. Confidence interval/ellipsoid could go outside Θ.

1.2 Score Test

Test : H0:θ=θ0 vs H1:θθ0.
We can bypass quadratic approximation entirely by using score as test statistic

1nln(θ0;X)Pθ0Nd(0,J1(θ0)),or Jn(θ0)12ln(θ0;X)Pθ0Nd(0,Id).

We could reject H0:θ=θ0 if ||Jn(θ0)12ln(θ0;X)||2χd2(α)d=1l˙n(θ0)Jn(θ0)N(0,1). Can do 1-sided tests.

Can be generalized to case with nuisance parameters. Typically estimate via MLE on Θ0.

Score test is invariant to reparameterization: assume d=1, θ=g(ξ), g(ξ)>0,ξ. qξ(x)=pg(ξ)(x). l(ξ)(ξ;x)=ddξlogpg(ξ)(x)=l(θ)(g(ξ);X)g(ξ),J(ξ)(ξ)=J(θ)(g(ξ))g(ξ)2, so l(ξ)(ξ0;X)J(ξ)(ξ0)=a.s.l(θ)(θ0;X)J(θ)(θ0) if θ0=g(ξ0).

Note i=1nπi=1, so this is a full-rank (d1) parameter exponential family. E.g. πj={11+k>1eηk,j=1,eηj1+k>1eηk,j>1.
So ln=(N2,,Nd)(nπ2,,nπd),Varη(l(η))=(nπ2(1π2)nπiπjnπiπjnπd(1πd))=n(diag(π2:d)π2:dπ2:dT),Jn(η)1=1n(diag(π2:d)1π1111T).
Here we use (A+uvT)1=A1A1uvTA11+vTA1u. So the score test of H0:π=π0 is ln(η0)Jn1(η0)ln(η0)=j=1d(Njnπ0j)2nπ0jPπ0χd12.

2 Generalized LRT

Test H0:θ=θ0 vs H1:θθ0. Taylor expand around θ^n:

ln(θ0)ln(θ^n)=l(θ^n)+12(θ0θ^n)T2ln(θ~n)(θ0θn^)=12(1n2ln(θ~n))12(n(θ0θ^n))2212χd2.

Test statistic: 2(ln(θ^n;X)ln(θ0;X))Pθ0χd2.


Consider H0:θΘ0 vs H1:θΘΘ0, assume

Then 2(ln(θ^n)ln(θ^0))χdd02, where θ^0=argminθΘ0ln(θ;X).

Why? Assume WLOG θ0=0, J1(0)=Id. Then θ^nNd(θ0,1nId). And locally, 2ln(θ)nId near θ0, ln(θ)ln(θ^n)n2||θθ^n||2θ^0argminθΘ0||θθ^n||=ProjΘ0(θ^n)2(ln(θ^0)ln(θ^n))n||θ^nProjΘ0(θ^n)||2=n||ProjΘ0(θn^)||2χdd02.

3 Asymptotic Equivalence

Recall quadratic approximation picture (d=1):

ln(θ)ln(θ0)l˙n(θ0)(θθ0)+12Jn(θ0)(θθ0)2.$$Forlarge$n$,$$ln(θ^n)ln(θ0)||Jn(θ0)12(θ^nθ0)||2.

Then Wald: ||J^n12(θ^nθ0)||2; Score: ||Jn(θ0)12ln(θ0)||2.

4 Asymptotic Relative Efficiency (ARE)

Suppose θ^n(i),i=1,2 are two asymptotic normal estimators of θ0R, with n(θ^n(i)θ0)N(0,σi2). The ARE of θ^(2) w.r.t θ^(1) is σ12σ22. E.g. if σ22=2σ12 then θ^(2) is 50% as efficient.
Interpretation: suppose σ12σ22=γ(0,1). Then for large n, θ^[γn](1)(X1,,X[γn])Dθ^n(2)(X1,,Xn)N(θ,σ22n). Using θ^(2) is like throwing away 100(1γ)% of the data and then using θ^(1).